บทนำสู่การเขียนโปรแกรมด้วย Triton: ปัญหาความเร็ว: ทำไมโค้ดที่ถูกต้องจึงทำงานช้า

ความ ปัญหาความเร็ว ระบุว่าเคอร์เนลที่สมบูรณ์แบบทางคณิตศาสตร์ เช่น $out = x + y$ อาจมีประสิทธิภาพแย่กว่าการวนซ้ำบนโปรเซสเซอร์ (CPU) หากไม่สามารถกระจายต้นทุนคงที่ของฮาร์ดแวร์กราฟิกได้อย่างเหมาะสม ซึ่งมักปรากฏในรูปแบบของ ภาษีการเปิดใช้งาน.

1. ข้อผิดพลาดเรื่อง "ความถูกต้อง"

ความถูกต้องตามหน้าที่ไม่ได้หมายความถึงประสิทธิภาพ แม้ว่าโค้ดทริตอนของคุณจะแจกจ่ายงานอย่างถูกต้องไปยังหลายพันเธรด แต่หากปริมาณงานทั้งหมด (N) มีขนาดเล็ก ก็จะทำให้หน่วยประมวลผลกราฟิก (GPU) ไม่ได้ใช้งานเต็มที่ ฮาร์ดแวร์จะใช้เวลานานในการเปลี่ยนสถานะมากกว่าการคำนวณจริง

2. กลเม็ดการวัดผลจากภาษาไพธอน

การทดสอบประสิทธิภาพโค้ดกราฟิกจากภาษาไพธอนโดยใช้ time.time() เป็นอันตราย เพราะการเรียกใช้กราฟิก (GPU) เป็น แบบไม่สัมพันธ์เวลา; ไพธอนเพียงแค่ ใส่ลงในคิว คำสั่งแล้วดำเนินการต่อไป ถ้าไม่มี torch.cuda.synchronize()คุณจะวัดเวลาที่รอคิว แต่หากมีการซิงโครไนซ์ คุณจะวัด เวลาหน่วงระหว่างโฮสต์กับอุปกรณ์ซึ่งมักยาวกว่าการประมวลผลเคอร์เนลเองถึง 10 เท่า

3. เวลาหน่วงเทียบกับอัตราการไหล

เพื่อแก้ไขปัญหานี้ คุณต้องมีงานเพียงพอเพื่อ 'ซ่อน' เวลาหน่วงการเปิดใช้งาน นี่คือการเปลี่ยนแปลงจากโหมดที่จำกัดด้วย จำกัดด้วยเวลาหน่วง โหมด (จำกัดด้วยแบนด์วิดธ์ระหว่างโปรเซสเซอร์และกราฟิก) สู่โหมดที่จำกัดด้วย จำกัดด้วยอัตราการไหล โหมด (จำกัดด้วยหน่วยความจำหรือการประมวลผลของกราฟิก)

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

For each kernel, decide whether the bottleneck is likely arithmetic throughput, memory bandwidth, or launch overhead: Vector addition (N=256), Vector addition (N=10^8), and Matrix Multiplication (N=8192).

N=256: Arithmetic; N=10^8: Bandwidth; MM: Launch

N=256: Launch; N=10^8: Bandwidth; MM: Arithmetic

N=256: Bandwidth; N=10^8: Arithmetic; MM: Launch

All are compute-bound.

QUESTION 2

In the context of the Performance Paradox, what is the primary bottleneck for a 'ReLU on a matrix' operation?

Arithmetic Throughput

Memory Bandwidth

L1 Cache Size

QUESTION 3

What does the term 'Asynchronous Execution' imply regarding GPU benchmarking?

The GPU and CPU always finish at the same time.

The CPU continues to the next line of code before the GPU kernel finishes.

The kernel runs faster on smaller GPUs.

Memory transfers are blocked by compute.

QUESTION 4

Why does $out = x + y$ exhibit low arithmetic intensity?

It uses three memory accesses (2 loads, 1 store) for a single floating-point operation.

The addition operation is too complex for the ALUs.

It requires shared memory synchronization.

It only runs on one SM.

QUESTION 5

How can the 'Launch Tax' be amortized in a real-world application?

By calling the kernel more frequently with smaller data.

By increasing the workload per launch (e.g., larger N or batching).

By using 16-bit floats instead of 32-bit floats.

By disabling the L2 cache.